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© Multiprocessing packet switching connection system having provision for error correction and 
recovery. 



© A large number of processing elements (604) 
(e.g. 4096) are interconnected by means of a high 
bandwidth switch (606). Each processing element 
(604) includes one or more general purpose micro- 
processors (1202). a local menrory (1210) and a 
DMA controller (1206) that sends and receives mes- 
sages through the switch (606) without . requiring 
^ processor intervention. The switch (606) that con- 
^ nects the processing elements is hierarchical and 
CO comprises a network of clusters. Sixtyfour process- 
or ing elements (604) can be combined to form a 
^ cluster and and sixtyfour clusters can be linked by 
O way of a Banyan network. Messages are routed 
2 through the switch (606) in the form of packets 
which include a command fiekj. a sequence number, 
O a destination address, a source address, a data field 
r% (which can include subcommands), and an error 
UJ correction code. Error correction is performed at the 
processing elements, if a packet is routed to a non- 
present or non-functional processor, the switch (606) 



reverses the source and destination field and returns 
the packet to the sender with an error flag, if the 
packet is misrouted to a functional processing ele- 
ment (604), the processing element (604) corrects 
the enor and retransmits the packet through the 
switch (606) over a different path. In one embodi- 
ment, each processing element can be provided with 
a hardware accelerator for database functions. In this 
embodiment, the multiprocessor of the present in- 
vention can be employed as a coprocessor to a 370 
host and used to perform database functions. 
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MULT,PROC.SS,NG PACKET SW.TCH.NG CONNECT,ON SYSTEM HAVING PROV.S.ON FOR ERROR COR- 

RECTION AND RECOVERY 



BACKGROUND OF THE INVENTION 

a. FIELD OF THE INVENTION 

This invention relates to the field of multi- 
processing systenns and en-or recovery in mul- 
tiprocessing systems. 

b. RELATED ART 

A multiprocessing system (MPS) is a comput- 
ing system employing two or more connected pro- 
cessing units to execute programs simultaneously. 
Conventionally, multiprocessing systems have 
been classified into a number of types based on 
the interconnection between the processors. 

A first type of conventional multiprocessing 
system is the "multiprocessor" or "shared mem- 
ory" system (Rg. i). In a shared memory system, 
number of central processing units 102-106 are 
interconnected by the fact that they share a com- 
mon giobal memory 108. Although each central 
processing unit may have a local cache memory, 
cross cache validation makes the caches transpar- 
ent to the user and the system appears as if it only 
has a single global memory. 

Shared memory systems also take the form of 
multiple central processing units sharing multiple 
global memories through a connection network. An 
example of such a system is an Omega network 
{Fig. 2). In an Omega network a plurality of switch- 
es S01-S24 organized into stages route data be- 
tween a plurality of processors P0-P7 and a plural- 
ity of global memories MD-M7 by using a binary 
destination tag generated by a requesting proces- 
sa-. Each stage of switches in the network decodes 
a respective bit of the tag to make ttie network self- 
routing. The Omega network thereby avoids the 
need for a central controller. 

A commai characteristic of shared memory 
systems is tfiat access time to a piece of data in 
tile memory is independent of tiie processor mak- 
ing the request A signifk:ant Rmrtation of shared 
memory systwns is tiiat the aggregate bandwidth 
of the global memory limits the number of proces- 
sors that can be effectively accommodated on the 
system. 

A second type of commonly known multi- 
processing system is the multicomputer message 
passing network (Fig. 3). Message passing net- 
works are configured by interconnecting a number 
of processing nodes. Each node 302-308 includes 
a central processing unit and a local memory that 
is not globally accessible. In order for an applica- 



tion to share data among processors the program- 
mer must explicitly code commands to move data 
from one node to another. In contrast to shared 
memory systems, the time that It takes for a pro- 
5 cessor to access data depends on its distance (in 
nodes) from the processor that currently has the 
data in its local memory. 

In the message passing network configuration 
of Fig. 3, each node has a direct connection to 
70 every other node. Such configurations are, how- 
ever, impractical for large number of processors. 
Solutions such as hypercube configurations have 
been conventionally used to limit the largest dis- 
tance between processors. In any ' event as the 
rs number of processors in tiie network increases the 
number of indirect connections and resulting mem- 
ory access times will also tend to increase. 

A third type of multiprocessing system is the 
hybrid machine (Rg. 4). Hybrid machines have 
20 some of the properties of shared memory systems 
and some of tiie properties of message passing 
networks. In the hybrid machine, a number of oro- 
cessors 402-406. each having a local memory.' are 
connected by way of a connection network 408, 
25 Even though all memories are local, the operating 
system makes the machine look like at has a single 
global memory. An example of a Hybrid machine is 
the IBM RP3. Hybrid machines can typically pro- 
vide access to remote data signiticantiy faster than 
30 message passing networks. Even so, data layout 
can be critical to algoritiim perfomnance and the 
aggregate communications speed of tiie connec- 
tion network is a limit to the number of processors 
tiiat can be effectively accommodated. 
35 A variant on multiprocessing system connec- 
tion networks is tiie cluster-connected network (Fig. 
5). In a cluster-connected networks, a number of 
clusters 502-508, each including a group of proces- 
sors 510-516 and a multiplexer/controller 518, are 
40 connected tiirough switch network 520. The cluster 
network has advantages over the topology of Fig. 4 
in that a larger number of processors can be effec- 
tively connected to the switch network tiirough a 
given number of ports. One constraint of cluster 
45 connected networks is that the bandwidtiis of botii 
tiie cluster controller and the switch are critical to 
system performance. For tiiis reason, tiie design of 
the switch and cluster controller are important fac- 
tors in determining maximum system size and per- 
50 formance. 

SUMMARY OF THE INVENTION 

It is a first object of this invention to improve 



2 



3 



EP 0 439 693 A2 



4 



the pen'ormance of ciusier-connected multiproces- 
sing systems. 

It is a second object of this invention to provide 
an efficient system for riard and soft error recovery 
in systems connected by v/ay of a connection s 
network. 

it is a third object of this invention is to provide 
a computer system capable of performing complex 
ad hoc queries against a relational database at 
speeds which are several orders of magnitude lo 
faster than with today's largest mainframe comput- 
ers. 

In accordance with he above objectives there 
is provided an improved multiprocessing system 
and method. J5 

tn a first embodiment, an improved cluster con- 
troller is provided. The improved cluster controller 
includes a switch for distributing packets received 
from the processing elements in accordance with a 
destination address and packet priority, a global 20 
storage, queues for controlling packet flow to the 
processing elements, an assembly buffer for as- 
sembling data from the processing elements into 
packets, and selection logic for selecting packets 
from any of the assembly buffer and the global 25 
storage to the switching network. 

In a second embodiment, a system and meth- 
od for recovering from errors in the destination field 
of data being transfen-ed between two nodes of a 
multiprocessing system having at least three nodes 30 
is provided. When data is misrouted to an improper 
node due to an errors in a destination address 
field, the error is detected and corrected. Once the 
error is con-ected data is rerouted to the correct 
node by way of an independent data path (i.e. one 35 
other than the one on which it was received). 
Advantageously, this enables recovery from both 
soft and hard enrors in the destination address field. 

In a third embodiment a multiprocessor net- 
work is provided. The network is architected as a 40 
plurality of cluster controllers which connect groups 
of processors by way of a switch. The processing 
elements each include a local memory which is 
accessible by each of the processors in the sys- 
tem. ^ 

In a fourth embodiment, a packet format for 
use In a cluster connected multiprocessing system 
is provided. The packet format includes a data 
field, source and destination fields, a field that can 
cause a write into a global memory of a cluster so 
controller, and en-or correct/detect fields. 

FEATURES AND ADVANTAGES 

1 . The connection network design of the present 
system employs mainframe technolo-gy to ss 
achieve a high bandwidth system interconnec- 
tion that is beyond the capabilities of many 
contemporary systems. High density packaging 



enables the use of wide buses (e.g. 130 bits), 
and high speed bipolar logic allows very high 
frequency system clock-ing (e.g. 5ns). A sus- 
tained band-width of 200GB/second is achiev- 
able for uniform random message transfers. 

2. A DMA Controller in each processing element 
provides efficient transmission of messages 
through a novel packet proto-col, which also 
enables the direct addressing of non-local 
memories. The latter capability is important for 
some software algorithms that assume a shared 
memory structure, and is also advantageous for 
system debugging and service functions. 

3. The interleaving of packets from multiple 
messages by the DMA controller effectively ran- 
domizes the pattern of packet transmissions and 
is important to achieving maximum bandwidth 
through the switch. 

4. The connection network design for packet 
switching provides efficient message broadcast- 
ing, and global storage for control functions, in 
addition to basic point-to-point message trans- 
mission. 

5. The packet format allows robust error han- 
dling.. The use of ECC together with the source 
(SRC) and destination (DST) identifiers in every 
packet permits efficient error correction or han- 
dling. If a hardware error results in the mis- 
routing of a packet, then one of two cases exist: 
(1) tiie packet gets misrouted to a non-existent 
or non-operational processing element, in which 
case the cluster controller reverses the SRC and 
DST fields and returns the packet to its sender 
with an error flag; or (2) the packet gets mis- 
routed to a functional processing element, which 
will reti-ans-mit tiie packet (after applying ECC 
as required). Retransmission can overcome soft 
errors and. in case 2 above, it can also cir- 
cumvent some hard failures by employing a 
different hardware path. 

6. This highly parallel processing structure, with 
its high bandwidth Interconnection, is well suited 
for a wide variety of applications, some exam- 
ples of which include database processing, logic 
simulation, and artificial intelligence. 
BRIEF DESCRIPTION OF THE DRAWINGS 

Rg. 1 is a block diagram of a prior art 
shared memory system. 

Rg. 2 is a block diagram of a prior art 
shared memory system configured 
using an Omega interconnection net- 
work. 

Fig. 3 is a block diagram of a prior art 

message passing network. 
Fig. 4 is a bkx:k diagram of a prior art 

hybrid system. 
Rg. 5 is a block diagram of a prior art 
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sv/itch/queues. The interconnection of a typical one 
of these switch/queues 802 is illustrated. Each of 
the input ports on each 8X8 switch/queue is bused 
to ail eight 8X1 switch/queues. Each SXl switch 
queue can take from 0 to 8 of its inputs (quintword 
packets) and enter them into a single fifo output 
queue in each cycle of the network clock. In the 
same cycle a single packet (the top of queue) can 
be taken off the queue and passed on to the next 
stage of the switch network or to the final destina- 
tion. If the queue is empty at the start of a cycle, a 
valid input packet can bypass the queue and go 
directly to the output, thus saving a cycle which 
would have been othenwise wasted in unneeded 
staging. 

Each packet carries with it its own destination 
address. The addressing mechanism provides the 
following function. Only those packets properly ad- 
dressed for the output port represented by a given 
switch queue will be actually be enqueued on that 
port. In addition, each packet will be enqueued on 
only one queue. The addresses must be such that 
an address corresponds to a unique path between 
a source and destination. Groups of 3 bits within 
each address represent the local addresses within 
each switch. A fixed priority scheme is used to 
determine in what order each of the simultaneous 
input packets is enqueued. Although a more so- 
phisticated scheme could be used, since every 
quintword packet has the opportunity to get on the 
queue on every cycle, the fixed priority scheme is 
inherently a "fair" one (i.e., no single source will 
get more or less than its share of entries on the 
queue, unless other sources have no data for this 
output port.) 

Rg. 9 is a more detailed diagram of the typical 
switch/queue 802 shown in Rg. 8. Each 
switch/queue contains a queue 902 of up to 84 
pack-ets. Each packet is a quintword (180 bits) in 
size. Each word includes 32 bits of data plus 4 bits 
of ECC. A packet from an input port is selected by 
the recognition logic 904 of a single switch/queue 
based on the destination address (DST id) which is 
contained in the control word portion of the packet 
Up to eight packets (one from each input port) may 
be enqueued at a given output port during each 
cycle. Simultaneously, each output port can select 
a packet for transmission, either from its iocaJ 
queue 902. or from short circuit logic 906 whk:h 
enables a single input to go directly to the output 
port register 910 when the queue is empty. Busy 
logic 908 is provided to preverrt forwarcSng a pack- 
et when a downstream queue is full. This design 
prevents an output from appearing to be busy 
during bursts of activity, and can thereby avoid 
propagating the busy condition to senders. 

As an example of operation, let us assume that 
three of the eight inputs to the 8X8 switch have 



valid addresses which direct them to the second 
output port. The recognition logic 904 will select on 
those three addresses to be gated to this part of 
the switch. If the output port queue 902 is not 

5 empty and is not full, the the input packets will be 
enqueued. If the output port queue 902 is full, the 
Busy Logic 908 will prevent the ingating of the 
packets, if the output port queue 902 is empty, the 
Short Circuit Logic 906 will take one of the three 

70 input packets, in accord with a conventional priority 
scheme, and pass it directly to the output port 
register 910, at the same time enqueueing the 
remaining two packets on the output port queue. 
The packet in the Output Port Register 910 will be 

75 gated to the next level of the switch as long as that 
level is not busy. 

Rg. 10 is a more detailed illustration of an 
exemplary one of the cluster controllers 602(1)- 
602(32) of Rg. 6. Cluster controller 1 602(1) will be 

20 used by way of example. Coming from the second 
stage of the switch network (switches 710-716), 
data receivec on the input bus 608(1 ) is routed to a 
9 from B switch 1002. The 9 from 6 switch 1002 
receives six inputs: one from the switching network 

25 606. one from a global store 1004 and four from a 
cluster controller assembly buffer 1006. The 9 from 
6 switch 1002 distributes the received data (from 
the six inputs) to the appropriate "octant" or to the 
global store 1004. The global store 1004 can be 

30 used for a variety of functions including sharing 
status between processing elements, process co- 
ordination, shared algorithm control, and shared 
data. 

In order to route the received data to the ap- 

35 propriate octants the 9 from 6 switch 1002 decodes 
3 bits from the internal packet destination address 
(DST). Alternatively, the gk)bal store 1004 is acces- 
sed by the switch 1002 decoding a global store 
access command. Any conflicts for output from the 

40 9 from 6 switch 1002 are resolved with a conven- 
tional priority and round robin scheme. The con- 
nection from the switching network 608(1) always 
has highest priority. Of the 9 outputs 1010(1-9) of 
the 9 from 6 switch 1002, eight are connected to 

45 octants of processing element queues. An exem- 
plary octant is designated by reference numeral 
1008. Each of eight outputs 1010(1) -1010(8) are 
connected to an individual octant of this type. Each 
octant includes eight processing element queues. 

50 Each queue is 16 packets deep and includes 
busy/full logic and short circuits for empty queues. 
Each octant has only one input (from the 9-from-6 
switch) and one output and enables one read and 
one write to occur simultaneously. 

55 Each duster controller 602(1) -602(32) furtiier 

includes 32 Processing Sement Ports (PEPs) 
101 2(1 )-1 01 2(32). Each processing element port in- 
cludes subports to interface with two processing 
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elements. Each subpoa includes a ^/vo byte ouiput 
port connected to a corresponding one of the pro- 
cessing element input busses 612(1-64) and a one 
byte input port connected to the corresponding one 
of the processing element output busses 61 4(1 -64) 
for each of two processing elements. The output of 
each queue Is bused to all four PEPs (for eight 
processing elements) in the octant. The PEPs use 
address decoding to ingate only those packets 
which are addressed to the appropriate processing 
element. Each PEP includes a packet buffer for the 
output port with logic to signal to the octant queues 
when the buffer is empty. 

Each of the eight octants operates indepen- 
dently, serving one of its eight PEP buffers one 
quintword each cycle, if a packet is available. From 
ihe PEPs, the packet is sent across the appropriate 
Processing element input bus to the addressed 
processing element, two bytes at a time. The 
asymmetry of the input and output buses (one 
versus two bytes) helps to prevent queue full con- 
ditions. 

In the inward direction (i.e. from the processing 
elements), one byte of data comes across one the 
input buses from a processing element into the 
corresponding processing element port (i.e. the 
PEP to which the PE is connected). From the 
processing element port, the incoming byte of data 
is routed directly into a port of an assembly buffer 
1006 which takes in successive bytes and forms a 
quintword packet. The assembly buffer has 64 slots 
(quintword memory locations) 101 4(1 )-1 01 4(64). In 
other words, there is one slot in the assembly 
buffer for each processing element, each operating 
independently an having its own byte counting and 
busy logic (not shown). 

The assembly buffer slots are arranged into 
four columns. Each column has its own round robin 
logic to select one slot of those which are com- 
plete. Each cycle of the network ctock. one quin- 
twonj packet from one slot in each column can be 
outgated. The outgated packets go to the 9-from-6 
switch 1002 and the 1-of-5 selector 1016. A fifth 
input to the 1 of 5 selector 1016 comes from the 
global store 1004. The 1-of-5 selector will, based 
on address and round robin logic, takes one packet 
which needs to be routed through the switch net- 
work 606 and send it on its way. Packets which are 
not successfully gated through either the 1-of-5 
selector or the 9-of-6 switch remain in their slots to 
be selected the next time the round robin algorithm 
allows. 

An example of the operation of the cluster 
controller, under a uniform distribution of mes- 
sages, is as follows: 

One input from the connected processing ele- 
ments, a byte per cycle, is read into each of the 
assembly buffers. Rve quintword packets per cycle 



can be outgated to the i-of-5 selec-tor. so *hat one 
quinr/vord per cycle is sent to another cluster ccn- 
troller. 

On the output to PE direction, up to 6 quin- 

5 tmr6 packets can be gated to up to 9 destinations, 
with queueing. Assuming a 5ns cycle of the cluster 
controller, with a lOns cycle on the input and 
output to PE buses, the cluster controller can input 
6-4 GB/sec from the PEs (100 MB/sec'PE). The 

70 assembly buffers and global memory can output 
12.8 GB/sec, up to 3.2 GB/sec of which can to to 
other cluster control-lers. Up to 19.2 GB/sec may 
enter into the output queues, and the output 
queues themselves can dispatch up to 28.8 GB/sec 

75 to the PEPs and Global Store. The PEPs each can 
deliver 200 MB/sec to their respective PEs, which 
aggregated would allow up to 12.8 GB/sec to flow 
out of the cluster controller to the PEs. While these 
are peak numbers, they show that the design is 

20 biased to allow a steady stream of 3.2 GB/sec to 
flow from PEs to other clusters, and up to 12.8 
GB/sec back out to the PEs. Again, the design is 
biased to prevent queues from filling and creating 
contention upstream in the switch. 

25 Rg. 12 shows a preferred embodiment of the 

processing elements 604(1 )-604(2048) of Rg. 6. It 
should be understood that the present multiproces- 
sor could use other types of processors as pro- 
cessing elements. The central processor 1202 of 

30 the processing element is preferably a state of the 
art RISC microprocessor. It is connected, in a con- 
ventional manner, to the processor cache 1204 
which gives fast access time to instructions and 
data. The bus from the cache 1204 ties into a DMA 

35 controller 1206. The DMA controller 1206 provides 
the cache 1204 bidirectional ports to each of the 
switch buffer 1208 and the main processing ele- 
ment storage 1210. The switch buffer 1208 is an 
input/output buffer which handles the data and pro- 

40 tocols to and from the cluster controller. The clus- 
ter controller connects to the processing element 
through the switch buffer 1208 by way of two 
unidirectional ports connected to individual busses 
1212, 1214. The first unidirectional port handles 

45 incoming traffic from the cluster controller to the 
processing element while the second unidirectional 
port handles outgoing traffic from the processing 
element to the cluster controller. 

Rg. 13 is a more detailed diagram of the DMA 

50 controller 1206 of Rg. 12. To process incoming 
messages, a Quintword Assembly Buffer 1302 
takes 2 bytes of data at a time from the cluster 
controller to processing element bus 1212 and 
reassembles the packet. The EGG logic 1304 

55 checks arKi restores the integrity of the data as 
well checks whether the packet arrived at the prop- 
er destination. 

Once the data integrity is verified or corrected 
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and it is determined that the packet has arrived at 
its proper destination, the Input Message Control 
Logic 1308 places the data on a queue in the PE 
storage 1210. This task Is acconnpiished by a Stor- 
age Arbitration Controller 1310, which can handle 5 
nnultlple requests for the PE storage 1210 and can 
resolve any storage conflicts. The Input Message 
Control Logic 1308 then signals the PE micropro- 
cessor 1202 that a message is available. 

When the PE microprocessor 1202 wishes to ;o 
send a message to another PE. it first enqueues 
the message on a destination queue in the PE 
storage 1210. The microprocessor 1202 then sig- 
nals the Output Message Control 1312 that a mes- 
sage is ready. It does this by doing a "store" 75 
operation to a fixed address. This address does not 
exist in the PE storage 1210 but is decoded by the 
Storage Arbitration Control 1310 as a special sig- 
nal The data for the "store" operation points to the 
destination queue In PE storage 1210. 20 

Before being sent to the cluster controller, each 
message in the destination queue is provided with 
a header The headers are kept locally in the DMA 
controller 1206 in the destination PE Q-header ar- 
ray 1314. The message header specifies the total 25 
length of the message in bytes (up to 4096), the id 
of the PE to which the message is to be sent (15- 
bit DST id), and the id of this sending PE (15-bit 
SRC id). 

To achieve high switch bandwidth, the DMA 30 
controller interleaves packets from multiple mes- 
sages, rather than send the messages sequentially. 
However, all messages from one processing ele- 
ment to another specific processing element are 
sent in order. The switch design ensures that the 35 
packets received by a processing element from 
another specrfic processing element are received in 
the same order in which they were sent. The 
Output Message Control Logic pre-fetches ai! or 
portions of the top message for the various destina- 4o 
tton into the Output Message Buffer 1316. From the 
Output Message Buffer 1316, the data is taken, one 
quintword at a time into the Quintword Disassembly 
Buffer 1318 where it is sent, a byte at a time, 
across to the cluster controller. 45 

As a further function, the DMA controller 1206 
also generates a nine bit SEC/DED Enror Correct- 
ing Code (ECC) for each packet prior to transmis- 
sion. 

The error correction function of the present so 
systenq will now be described in more detail. As 
previously explained, as message packets arrive at 
a processing element the DMA controller 12O6 
applies the ECC, and then perfonns the function 
specified by the packet command field. If the ECC 55 
indicates that a single bit enror occun-ed in the DST 
id of the received packet, then the packet should 
have gone to some other processing element, so 



the DMA controller 1206 corrects the DST id and 
retransmits the packet to the correct processing 
element. Where the cluster network is configured 
with a host processor, the DMA controller 1206 
also reports this error event to a host processor 
service subsystem. This is accomplished by gen- 
erating an interruption to software on the host pro- 
cessor, which reports the error to the service sub- 
system under the control of a thresholding algo- 
rithm. 

While ECC is generated in the sending pro- 
cessing element and applied in the receiving pro- 
cessing element, parity checking is also performed 
every time a packet enters or leaves a TOM, and 
upon receipt by the destination processing ele- 
ment. Thus, conectable en-ors are detected and 
can be reported to the service system as soon as 
they begin to occur. 

The self correcting error handing of the present 
system will be better understood by reference to 
Rg. 6. We will assume, for example, that there is a 
cabling problem between cluster 602(1) and the 
32X32 switch network 606, that will cause a hard 
error in the destination address field of an incoming 
packet We will further assume that the incoming 
packet was intended for processing element 604(3) 
on cluster controller 602(1) but instead, due to the 
hard error, arrives as processing element 604(1) on 
the same cluster controller. 

The receiving processing element 604{1) will 
receive the incoming packet by way of the 9-6 
switch and a PEP output bus. Once the packet is 
received, the processing element will conect the 
destination field error (using the ECC), and resend 
the packet on the cluster controller 602(1) to the 
conrect PE 604(1) by way of the PEP input bus and 
the assembly buffer. Since the packet will no long- 
er travel the path of the problem connection, the 
hard error will not be repeated for this packet 

A similar procedure ensures conection of many 
errors where an incon-ect destination address is 
caused on the bus from the cluster controllers to 
the switch network 606. It will be noted that each 
cluster has a separate input bus and output bus. 
Therefore, If the destination address of an outgoing 
packet is altered due to a misconnection on the 
output side of the bus and a packet is sent to the 
wrong cluster controller, the path between the cor- 
rect cluster controller and the receiving/conrecting 
cluster controller will completely differ from the 
path between the originating processor and the 
receiving/correcting processor. 

The switch network 806 itself also includes 
en-or correction logic. Therefore, if a packet is 
routed to a non-present or non-operational process- 
ing element, the switch will reverse the source and 
destination fields and send the packet back to the 
sender with an error indication. 
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Rg. 11 shows a preferred embodiment for a a 
packet format used with the system of Fig. 6. Each 
packet is 130 bits wide and includes a 5 bit com- 
mand field (CMD), an 8 bit sequence number field 
(SEQ). a 15 bit destination address field (DST), a 
15 bit source address field (SRC), a 128 bit data 
field and a 9 bit error correction code (ECC). 

The command field (CMD) includes a five bit 
command that tells the cluster controller and the 
receiving processing element how to handle the 
packet. The sequence number field (SEQ) includes 
an 8 bit packet sequence number sequentially as- 
signed by the originating (source) processing ele- 
ment. The sequence number enables the receiving 
system to identify which packet number of the total 
packet count in the message has been received. 

The destination address field (DST) includes a 
fifteen bit destination processing element number. 
The destination field is used by the switch and 
cluster controller to self route the packet and by 
the receiving (destination) processing element to 
verify that the packet has been routed to the proper 
address. 

The source address field (SRC) includes a 
fifteen bit originating (source) processing element 
number The source field is used by the switch and 
cluster controller to return the packet to the source 
in a case where an inoperable or non-present pro- 
cessing element number appears in the destination 
address field (DST) field, and by the receiving 
(destination) processing element to properly ad- 
dress any response to the message or command. 

The data field (DATA) includes 128 bits of 
information. The type of information in the data 
field is defined by the command field (CMD). 

The ECC Reld (ECC) includes an SEC/DED 
(Single Enror Correct/Double Error Detect) error 
correction code. 

For message header packets, the sequence 
field specifies the total length of the message, and 
the DMA controller allocates a message buffer of 
this length in the PE local memory, writes the initial 
quadword of data into the message buffer, and sets 
local hardware pointer, length and sequence regis- 
ters if there will be more packets of data for this 
message. It also constructs the message header in 
memory, which includes the message length. DST 
id and SRC id. 

For message body packets, the sequence 
number field is checked against the sequence reg- 
ister to verify that packets are arriving in order, and 
each quadword of data is added to the message 
buffer. When the message has been completely 
received it is enqueued on a queue in local mem- 
ory, known as the iN_QUEUE. for processing by 
the local processor, tf the 1N_QUEUE had been 
empty prior to the addition of this message, then 
an interruption is generated to the local processor 



to notify it of pending work. 

For storage access command packets, the 
DMA controller performs ^e required fetch or store 
operation to the PE local memon/ (transferring a 

5 doubleword of data), and for fetches a response 
packet is constructed by reversing the SRC and 
DST id fields, and then sent on the through the 
switch to retum the requested doubleword of data. 
Packets that contain global storage access 

10 commands are handled in the cluster controller in 
the same way that local storage access commands 
are handled by the DMA controllers. In both cases, 
the memory operations are autonomous, and in- 
clude a compare-and-swap capability. 

75 Rg. 14 depicts a preferred layout of a process- 

ing element/cluster board. In terms of physical lay- 
out, a cluster preferably comprises a multilayer 
circuit board 1400 on which up to 64 processing 
element cards (i.e. circuit boards which each em- 

20 body a processing element) are mounted directly, 
and at least one cluster controller thermal conduc- 
tion module (TCM) 1402. Each cluster controller 
handles local message passing within the cluster, 
and connects to the switch network 606. 

25 Fig. 15 shows a preferred system frame layout 

with four clusters in each of eight frames 1502- 
1516. The switch network thermal conduction mod- 
ules are preferably embodied in central frames 
1518-1524. The Host Adapter 1700 (Rg. 17) can 

30 reside in any one of the switch network frames 
1502-1516. For availability and configurability rea- 
sons, an additional Host Adapter 1700 can be pro- 
vided in another one of the switch network frames 
1502-1516. 

36 Rg. 16 shows a preferred layout for a process- 

ing element card 1600, includ-ing the high perfor- 
mance RISC microprocessor 1202, the optional 
database accelerator 1602, the DMA controller 
1206, and the local memory 1210. The processing 

40 element cards 1600 have twice as many pins as 
can be connect-ed to the cluster controller TCM. 
Therefore, a second set of PE buses (a second 
"PE port") is brought of the processing card and 
onto the mother board (the TCM mother board) 

45 where it is routed to the second (spare) cluster 
controller TCM position (1404, Rg, 14). This allows 
for future expansion: as CMOS densities continue 
to improve, a second PE could be packaged per 
card, and duplicate cluster controller and switch 

50 network TCM's could be plugged into the pre-wired 
boards, doubling the size of the system to 4096 
PEs. Alternatively, with the optional cluster control- 
ler and switch network TCM's plugged, each PE 
could use two PE ports to the duster controller. 

56 either for higher bandwidth or for improved fault 
tolerance. 

The above-described system can be built as a 
stand-alone multiprocessing system, as a stand- 
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alone database processor or employed as a 
coprocessor to a traditional mainframe host In the 
latter case, the host system would provide the 
front-end MVS/DB2 system functions, including 
session management, transaction processing, s 
database locking and recovery. The present mul- 
tiprocessor system could also be employed as a 
back-end system to offload and accelerate the 
read-only complex query processing functions from 
the host 

Many modifications and variations which can 
be made without departing from the scope and 
spirit of the invention wiil now occur to those of skill 
in the art. It should be thus understood, that the 
present description of the system provided as an 75 
example and not as a limitation. 

Ctaims 

1- A multiprocessing system having at least three 20 
nodes, comprising: 

a first node (604) comprising means to trans- 
mit data comprising destination identifying in- 
formation; 25 

a second node (604) coupled to said first node 
(604), said second node comprising means for 
receiving said data along a first path, means 
(1 206) for detecting and correcting an error in 30 
said destination identifying information so as to 
form corrected destination Identifying informa- 
tion and means for rerouting said data, along a 
second independent path, to a third node (604) 
identified by said corrected destination iden- 35 
tifying infonmation. 

2. The muitiprocessirig system of claim 1 wherein 
said first second and third nodes are proces- 
sors and wherein said processors are each 40 
coupled to a self routing switch (606) by way 

of independent input (608, 612) and output 
(610, 614) data paths. 

3. The multiprocessing system of claim 2 wherein 46 
said data is packetized and comprises a des- 
tination field including said destination identify- 
ing information and a source fieid identifying a 
source processor. 

50 

4- The multiprocessing system of ciaim 2 wtierein 
said switch (608) comprises means for detect- 
ing when said destination field identifies a non- 
present processor, for reversing said source 
and destination fields and for rerouting said se 
data to said source processor. 

5, A method of error recovery in a multiprocessor 



system connected by a switching network, 
wherein a first processor (604) in said system 
transmits a data packet having an address fieid 
comprising an address of a second processor 
(304) in said system, comprising the steps of: 

transmitting said packet from said first (604) 
processor to said switching network (606) by 
way of a first path; 

decoding said address field in said transmitted 
packet at said switching network (606); 

routing said packet by way of a first path from 
said switching network (606) to a third proces- 
sor (604) in said system designated by said 
decoding; 

detecting at said third processor (604), an error 
in said address field of said packet; 

correcting said error at said third processor 
(604) to form a corrected address in said ad- 
dress field; 

retransmitting said packet having said correct- 
ed address from said third processor (604) to 
said switching network (606) by way of a sec- 
ond path; 

decoding said address field in said retransmit- 
ted packet at said switching network (606); and 

routing said retransmitted packet from said 
switching network to said second processor 
(604) by way of a third path. 

6, The method of claim 5 comprising the further 
steps ot 

determining at said switch (606) if a decoded 
address corresponds to a non-operable pro- 
cessor; and 

when said determining determines that said 
decoded address corresponds to said non-op- 
erable processor, causing said switch (606) to 
exchange said source and destination fields in 
said packet and returning said packet to said 
first processor (604) by way of a fourth path; 

7. A duster controller (602) for use in a mul- 
tiprocessor systenrt comprising a plurality of 
processor clusters coupled by way of a switch- 
ing network, said cluster contiroller (602) com- 
prising: 

switching moans (1002), connected to receive 
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packets from said switching network, for dis- 
tributing said packets from said switching net- 
work in accordance with a destination address; 

global storage means (1004) for storing data, 
said global storage means being connected to 
receive said packets from said switching 
means (1002): 

queue means (1008) for buffering packet flow 
to a plurality of processors, said queue means 
comprising a plurality of packet queues asso- 
ciated with each of said processors; 

a plurality of first busses, each of said first 
busses being connected to an output port of 
• said switching means and an input port of one 
of said packet queues, said first busses having 
a first number of bits; 

a plurality of processing elennent port means 
(612, 614) for transferring data between said 
cluster controller and said processors; 

a plurality of second busses, each of said 
second busses being connected to an output 
port of one of said packet queues and an input 
port (612) of one of said processing element 
port means, 

assembly buffer means (1006) for assembling 
data from said processors into packets, said 
assembly buffer means (1006) comprising one 
assembly buffer (1014) for each of said pro- 
cessors and round robin means for selecting 
an assembled packet to be output, said as- 
sembly buffer means (1006) being connected 
to receive said data from said processing ele- 
ment ports; and 

selector means (1016) for selecting one packet 
to be sent to said switching network (808). said 
selector means being connected to receive 
packets from said assembly buffer means 
(1006) and said global store means (1004). 

a. The system of claim 7 wherein said selector 
means (1016) further comprises selector 
means for outputting said packets in round 
robin fashion. 

9. The system of claim 7 wherein said second 
busses have a second number of bits larger 
than said first number of bits; 

10. A cluster connected multiprocessing system 
comprising: 



a first pluralir/ of processors (604). wherein 
each of said processing elements in said first 
plurality comprises a local memory; 

5 a second plurality of processors (604), wherein 

each of said processing elements in said sec- 
ond plurality comprises a local memory; 

first cluster controller means (602) connected 
10 to receive first data from said first plurality of 

processors (604), for assembling said first data 
into packets comprising a source field, a des- 
tination field and a command field, and for 
outputting said first plurality of packets; 

second cluster controller means (602) connect- 
ed to receive second data from said second 
plurality of processors (604), for assembling 
said second data into packets comprising a 
20 source field, a destination field and a com- 

mand field, and for outputting said packets; 
and 

switching network means (606) connected to 
25 receive said packets from said first and second 

cluster controller means (602), for decoding 
said destination field and for determining which 
one of said cluster controller means is con- 
nected to an addressed processors corre- 
30 spending to said decoded destination field and 

for routing said packets to said one of said 
cluster controller means (602). 

11. The system of claim 10 wherein said first and 
35 second cluster controller means (602) each 

comprise means for outputting said packets in 
round robin fashion. 

1Z The system of claim 10 wherein each of said 
40 processors (604) in said first and second plu- 

rality comprises means for providing direct ac- 
cess to said local memory by every other 
processor in said first and second plurality. 

45 13. The system of claim 10, further comprising: 

host adaptor means (1700). for coupling a host 
processor to said switching network means, 
said host adaptor means comprising: 

50 

means (1714) for receiving a set of commands 
from said host processor; and 

means (1710) for distributing said commands 
55 annong a plurality of said processors. 

14. The system of claim 13 wherein said host 
adaptor means further comprises: 
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means (1706) for translating first memory ad- 
dresses from said host processor to a band of 
second memory addresses in a iocai memory 
within each of said plurality of said processors. 

15. The system of claim 14 wherein each of said 
processors comprises a general purpose pro- 
cessor and a database accelerator. 

16. The system of claim 15 wherein at least one of 
said database accelerators is a sort coproces- 
sor. 

17. A cluster controller (602) for use in a mul- 
tiprocessor system comprising a plurality of 
processing element clusters coupled by way of 
a switching network (606), said cluster control- 
ler (602) comprising: 

switching means (1002). connected to receive 
packets from said switching network (606), for 
distributing said packets from said switching 
network (606) in accordance with a destination 
address; 

queue means (1008), coupled to a plurality of 
said processing elements (604), for buffering 
packet flow to said plurality of processing ele- 
ments, said queue means comprising a plural- 
ity of packet queues associated with each of 
said processing elements, 



carrying a message body; 

a third defined pattern of bits which identifies a 
packet including said command field as carry- 
5 ing a message header; 

a sequence number field for can7ing any of a 
sequence number of a packet where said com- 
mand field defines said packet as a message 
10 body, and a count of message packets to 

follow where said command field defines said 
packet as a message header; 

a destination field for carrying an first address 
^5 of a destination processing element in said 

cluster connected multiprocessing system; 

a source field for carrying a second address of 
a source processing element in said cluster 
20 connected system; 

a data field; and 

an error correction code field fcr carrying an 
25 error correct, enror detect correction code. 

19. The packet format of claim 18, wherein said 
command field further comprises a fourth de- 
fined pattern of bits which when decoded by 
30 said cluster controller will cause a local mem- 

ory in a processor connected to said cluster 
controller to be accessed. 



assembly buffer means (1006), coupled to said 
plurality of said processing elements (604), for 35 
assembling data from said processing ele- 
ments (604) into packets, said assembly buffer 
means (1006) comprising one assembly buffer 
(1014) for each of said processing elements 
(604); and 40 

selector means (1016), coupled to said assem- 
bly buffer means (1006) for selecting a packet 
from said assembly buffer means (1006) to be 
sent to said switching network (606). 45 

18, A packet format for use in a cluster connected 
muitiprocessing system comprising: 

a command field comprising: 50 

a first defined pattern of bfts which when de- 
coded by a cluster controller within said mul- 
tiprocessing system will cause a write to a 
global memory within said cluster controilen ss 

a second defined pattem of bits which iden- 
tifies a packet including said command field as 



12 



EP 0 439 693 A2 



F I G o 1 CPRIOR ART) 




FIG o 3 (PRIOR ART) 



304 



CPU 



MEMORY 



302 



CPU 



MEMORY 



306 



308 



CPU 



MEMORY 



CPU 



MEMORY 



13 



EP 0 439 693 A2 




14 



EP 0 439 693 A2 



r IG o 4 (PRIOR ART) 



^02 



406 



MEMORY 



CPU 



502 



504 



510 

512' 
514' 

516' 
I 



MEMORY 



CPU 



CONNECTION NETWORK 



FIG o 5 (PRIOR ART) 



506 



508' 



CLUSTER 
1 




CLUSTER 
2 



MEMORY 



CPU 



PROC 1 



PRGC 2 



PROC 3 



PROC 8 



CLUSTER 
3 



518 



MULTIPLEXER 



CLUSTER 
64 




PORT 



PORT 2 



520 



PORT 3 



64 X 64 
SWITCH 
NETWORK 



PORT 64 



408 



15 



EP 0 439 693 A2 




EP 0 439 693 A2 




17 



EP 0 439 693 A2 



FIG. 8 




18 



EP 0 439 693 A2 



00 




X 




CO 




I-H 




[GHT 




Ll! 




.OF 




1ST, 




BUFFER 




rWORD 




QUIN1 




-X 




WITH 




rcH 




.IMS 1 




X 




03 





o 



00 
X 
CO 

2: 



\D 

LjJ 

Ll. 
O 

Q 

C\J 



cr 
u 
Ll 
Ll. 
ID 
CD 

O 
QL 
O 



3 
O 




O 





C\J 












> 




J — 








2: 




LlJ 






in 


LU 


Z) 


=) 


CD 


LU 




ZD 




O 



in 

I LJ_ 



i— 



CO 



A A A A A / 



SHORT 
CIRCUIT 




C 
00 



I— LU 
aCZLO 

dclld 

O UJ 

cr 



< 
u 

>- 
h- 
a 

UJ 
LU 

u 

Z) 

a 

cr 
o 
u. 



< 

CL 



D 
U 

U 



cr 
o 

in 



INPUT PORT REGISTERS 
] QUINTWORD EACH 



BNSaXia <£P 0439e93A2J_> 



19 



EP 0 439 693 A2 




20 



EP 0 439 693 A2 



IG, ] 1 



PACKET FORMA' 





CMD 


SEQ 
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DATA 
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8 


15 


15 


128 
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FIELD 


SIZE 


MEANING 





CMD 



COMMAND FIELD: 



00001 -MESSAGE HEADER; DATA FIELD 

HOLDS FIRST 16 BYTES OF 
MESSAGE: SEQ FIELD HOLDS 
COUNT OF MESSAGE 30DY 
PACKETS THAT FOLLOW (TOTAL 
MESSAGE IS 1-256 PACKETS, 
16-4096 BYTES) 

00010- MESSAGE BODY 

0001 1- CONTRGL FUNCTION .DATA FIELD 
PROVIDES SUBCOMMAND 

00100- GLOBAL STORAGE ACCESS, DATA 
FIELD PROVIDES OP/ADDR/DATA 

00101- PE STORAGE ACCESS, DATA FIELD 
PROVIDES OP/ADDR/DATA 

10000- GLOBAL BROADCAST TO ALL PE's 

10001- BROADCASTALL PE's ON SI 
ADDRESSED BY DST, PER MASK 
IN DATA FIELD(0-63) ;DATA 
FIELD (64- I 27) HOLDS MESSAGE 



SEQ 8 SEQUENCE NUMBER IN MESSAGE BODY 

PACKETS; IN MESSAGE HEADER PACKETS, 
COUNT OF MESSAGE BODY PACKETS 
TO FOLLOW 

DST 15 DESTINATION PE NUMBER 

SRC 15 SOURCE PE NUMBER 

DATA 128 DATA CONTENT FOR DATA PACKETS; 

SUBCOMMAND FOR CONTROL PACKETS; 
OPERATION TYPE, ADDRESS AND DATA 
FOR STORAGE ACCESSES 

ECC 9 SEC/DED ERROR CORRECTION CODE 



180 BITS PER PACKET 
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@ A large number of processing elements (604) 
(e.g. 4096) are interconnected by means of a high 
bandwidth switch (606). Each processing element 
(604) includes one or more general purpose micro- 
processors (1202), a local memory (1210) and a 
DMA controller (1206) that sends and receives mes- 
sages through the switch (606) without requiring 
^ processor intervention. The switch (606) that con- 
^ nects the processing elements is hierarchical and 
comprises- a network of clusters. Sixtyfour process- 
ing elements (604) can be combined to form a 
CO cluster and and sixtyfour clusters can be linked by 
way of a Banyan network. Messages are routed 
^ through the switch (606) in the form of packets 
^ which include a command field, a sequence number, 
Q a destination address, a source address, a data field 
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(which can include subcommands), and an error 
correction code. Error correction is performed at the 
processing elements. If a packet is routed to a non- 
present or non-functional processor, the switch (606) 
reverses the source and destination field and returns 
the packet to the sender with an error flag. If the 
packet is misrouted to a functional processing ele- 
ment (604). the processing element (604) corrects 
the error and retransmits the packet through the 
switch (606) over a different path. In one embodi- 
ment, each processing element can be provided with 
a hardware accelerator for database functions. In this 
embodiment, the multiprocessor of the present in- 
vention can be employed as a coprocessor to a 370 
host and used to perform database functions. 
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